Categories and curves
Workspace setup:
As we develop more useful models, we’ll begin to practice the art of generating models with multiple estimands. An estimand is a quantity we want to estimate from the data. Our models may not themselves produce the answer to our central question, so we need to know how to calculate these values from the posterior distributions.
This is going to be different from prior regression courses (PSY 612), where our models were often designed to give us precisely what we wanted. For example, consider the regression:
\[ \hat{Y} = b_0 + b_1(D) \] Where \(Y\) is a continuous outcome and \(D\) is a dummy coded variable (0 = control; 1 = treatment).
Forget dummy codes. From here on out, we will incorporate categorical causes into our models by using index variables. An index variable contains integers that correspond to different categories. The numbers have no inherent meaning – rather, they stand as placeholders or shorthand for categories.
data("Howell1")
d <- Howell1
library(measurements)
d$height <- conv_unit(d$height, from = "cm", to = "feet")
d$weight <- conv_unit(d$weight, from = "kg", to = "lbs")
d <- d[d$age >= 18, ]
d$sex <- ifelse(d$male == 1, 2, 1) # 1 = female, 2 = male
head(d[, c("male", "sex")]) male sex
1 1 2
2 0 1
3 0 1
4 1 2
5 0 1
6 1 2
Let’s write a mathematical model to express weight in terms of sex.
\[\begin{align*} w_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{SEX[i]} \\ \alpha_j &\sim \text{Normal}(130, 20)\text{ for }j = 1..2 \\ \sigma &\sim \text{Uniform}(0, 50) \end{align*}\]
quap() mean sd 5.5% 94.5%
a[1] 92.25778 0.8841999 90.84465 93.67090
a[2] 107.17390 0.9411695 105.66973 108.67807
sigma 12.10285 0.4561562 11.37382 12.83188
Here, we are given the estimates of the parameters specified in our model: the average weight of women (a[1]) and the average weight of men (a[2]). But our question is whether these average weights are different. How do we get that?
List of 2
$ sigma: num [1:10000] 11 12.1 11.7 12.5 11.8 ...
$ a : num [1:10000, 1:2] 91.3 92.3 93 90.9 92.1 ...
- attr(*, "source")= chr "quap posterior: 10000 samples from m1"
[,1] [,2]
[1,] 91.33986 108.7891
[2,] 92.25883 107.4387
[3,] 93.04045 107.2987
[4,] 90.90761 108.3809
[5,] 92.08847 106.5157
[6,] 92.88008 106.5985
mean sd 5.5% 94.5% histogram
sigma 12.10268 0.4508951 11.38862 12.81747 ▁▁▂▇▇▃▁▁
a[1] 92.23226 0.8899818 90.78476 93.64546 ▁▁▁▁▂▃▇▇▇▃▂▁▁▁▁
a[2] 107.17433 0.9455844 105.65103 108.69716 ▁▁▂▇▇▃▁▁
diff_fm -14.94207 1.2950456 -17.03328 -12.87254 ▁▁▁▃▇▇▃▂▁▁▁
We can create two plots. One is the posterior distributions of average female and male weights and one is the average difference.
p1 <- post %>% as.data.frame() %>%
pivot_longer(starts_with("a")) %>%
mutate(sex = ifelse(name == "a.1", "female", "male")) %>%
ggplot(aes(x=value, color = sex)) +
geom_density(linewidth = 2) +
labs(x = "weight(lbs)")
p2 <- post %>% as.data.frame() %>%
ggplot(aes(x=diff_fm)) +
geom_density(linewidth = 2) +
labs(x = "difference in weight(lbs)")
( p1 | p2)A note that the distributions of the mean weights is not the same as the distribution of weights period. For that, we need the posterior predictive distributions.
pred_f <- rnorm(1e4, mean = post$a[,1], sd = post$sigma )
pred_m <- rnorm(1e4, mean = post$a[,2], sd = post$sigma )
pred_post = data.frame(pred_f, pred_m) %>%
mutate(diff = pred_f-pred_m)
# plot distributions
p1 <- pred_post %>% pivot_longer(starts_with("pred")) %>%
mutate(sex = ifelse(name == "pred_f", "female", "male")) %>%
ggplot(aes(x = value, color = sex)) +
geom_density(linewidth = 2) +
labs(x = "weight (lbs)")
# plot difference
# Compute density first
density_data <- density(pred_post$diff)
# Convert to a tibble for plotting
density_df <- tibble(
x = density_data$x,
y = density_data$y,
fill_group = ifelse(x < 0, "male", "female") # Define fill condition
)
# Plot with area fill
p2 <- ggplot(density_df, aes(x = x, y = y, fill = fill_group)) +
geom_area() + # Adjust transparency if needed
geom_line(linewidth = 1.2, color = "black") + # Keep one continuous curve
labs(x = "Difference in weight (F-M)", y = "density") +
guides(fill = "none")
(p1 | p2)In the rethinking package, the dataset milk contains information about the composition of milk across primate species, as well as some other facts about those species. The taxonomic membership of each species is included in the variable clade; there are four categories.
kcal.per.g). 1'data.frame': 29 obs. of 8 variables:
$ clade : Factor w/ 4 levels "Ape","New World Monkey",..: 4 4 4 4 4 2 2 2 2 2 ...
$ species : Factor w/ 29 levels "A palliata","Alouatta seniculus",..: 11 8 9 10 16 2 1 6 28 27 ...
$ kcal.per.g : num 0.49 0.51 0.46 0.48 0.6 0.47 0.56 0.89 0.91 0.92 ...
$ perc.fat : num 16.6 19.3 14.1 14.9 27.3 ...
$ perc.protein : num 15.4 16.9 16.9 13.2 19.5 ...
$ perc.lactose : num 68 63.8 69 71.9 53.2 ...
$ mass : num 1.95 2.09 2.51 1.62 2.19 5.25 5.37 2.51 0.71 0.68 ...
$ neocortex.perc: num 55.2 NA NA NA NA ...
\[\begin{align*} K_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{\text{CLADE}[i]} \\ \alpha_i &\sim \text{Normal}(0, 0.5) \text{ for }j=1..4 \\ \sigma &\sim \text{Exponential}(1) \\ \end{align*}\]
Exercise: Now fit your model using quap(). It’s ok if your mathematical model is a bit different from mine.
flist <- alist(
K ~ dnorm( mu , sigma ) ,
mu <- a[clade_id] ,
a[clade_id] ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
)
m2 <- quap(
flist, data = milk
)
precis( m2, depth=2 ) mean sd 5.5% 94.5%
a[1] -0.4843499 0.21764135 -0.83218279 -0.1365170
a[2] 0.3662533 0.21705907 0.01935096 0.7131556
a[3] 0.6752225 0.25753421 0.26363311 1.0868119
a[4] -0.5858096 0.27450917 -1.02452826 -0.1470909
sigma 0.7196460 0.09653364 0.56536663 0.8739254
rethinkingPlot the following distributions:
post <- extract.samples( m2 )
names(labels) = paste("a.", 1:4, sep = "")
post %>% as.data.frame() %>%
pivot_longer(starts_with("a")) %>%
mutate(name = recode(name, !!!labels)) %>%
ggplot(aes(x = value, color = name)) +
geom_density(linewidth = 2) +
labs(title = "Posterior distribution of expected milk energy")post <- extract.samples( m2 )
a.1 = rnorm(1e4, post$a[,1], post$sigma)
a.2 = rnorm(1e4, post$a[,2], post$sigma)
a.3 = rnorm(1e4, post$a[,3], post$sigma)
a.4 = rnorm(1e4, post$a[,4], post$sigma)
data.frame(a.1, a.2, a.3, a.4) %>%
pivot_longer(everything()) %>%
mutate(name = recode(name, !!!labels)) %>%
ggplot(aes(x = value, color = name)) +
geom_density(linewidth = 2) +
labs(title = "Posterior distribution of predicted milk energy")Let’s return to the weight example. What if we want to control for height?
\[\begin{align*} w_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{S[i]} + \beta_{S[i]}(H_i-\bar{H})\\ \alpha_j &\sim \text{Normal}(130, 20)\text{ for }j = 1..2 \\ \beta_j &\sim \text{Normal}(0, 25)\text{ for }j = 1..2 \\ \sigma &\sim \text{Uniform}(0, 50) \end{align*}\]
mean sd 5.5% 94.5%
a[1] 99.471403 0.9579446 97.940423 101.002384
a[2] 99.607748 1.0007779 98.008311 101.207184
b[1] 43.338468 4.0421564 36.878321 49.798614
b[2] 40.107233 3.6519552 34.270704 45.943763
sigma 9.323176 0.3515417 8.761345 9.885008
List of 3
$ sigma: num [1:10000] 9.64 9.32 9.6 9.64 9.8 ...
$ a : num [1:10000, 1:2] 98.9 98.5 99.5 99.2 99.3 ...
$ b : num [1:10000, 1:2] 39.2 38.5 35 44.7 40.1 ...
- attr(*, "source")= chr "quap posterior: 10000 samples from m3"
Plot the slopes using extract.samples()
xbar = mean(d$height) # need this because we centered
post <- extract.samples(m3) # sample intercepts and slopes from the posterior
plot(d$weight ~ d$height, cex=0.5, pch=16, col=col.alpha("darkgrey",0.5),
xlab = "height", ylab = "weight")
#plot the lines implied by the first 50 draws from the posterior
for(i in 1:50){
curve(post$a[i, 1] +post$b[i, 1]*(x-xbar),
add = T,
col=col.alpha("#1c5253",0.1))
curve(post$a[i, 2] +post$b[i, 2]*(x-xbar),
add = T,
col=col.alpha("#e07a5f",0.1))
}Plot the slopes using link(). (Run this yourself and open up the objects muF and muM to determine what the link() function is doing.)
xseq <- seq( min(d$height), max(d$height), len=100) # some values for X
plot(d$weight ~ d$height, cex=0.5, pch=16, col=col.alpha("darkgrey",0.3),
xlim = range(d$height), ylim = range(d$weight),
xlab = "height", ylab = "weight")
muF <- link(m3, data=list(sex=rep(1,100), height=xseq, Hbar = mean(d$height)))
lines(xseq, apply(muF, 2, mean), lwd = 2, col = "#1c5253" )
muM <- link(m3, data=list(sex=rep(2,100), height=xseq, Hbar = mean(d$height)))
lines(xseq, apply(muM, 2, mean), lwd = 2, col = "#e07a5f")Return to the milk data. Write a mathematical model expressing the energy of milk as a function of the species body mass (mass) and clade category. Be sure to include priors. Fit your model using quap().
\[\begin{align*} K_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{\text{CLADE}[i]} + \beta_{\text{CLADE}[i]}(M-\bar{M})\\ \alpha_i &\sim \text{Normal}(0, 0.5) \text{ for }j=1..4 \\ \beta_i &\sim \text{Normal}(0, 0.5) \text{ for }j=1..4 \\ \sigma &\sim \text{Exponential}(1) \\ \end{align*}\]
dat <- list(
K = standardize(milk$kcal.per.g),
M = milk$mass,
Mbar = mean(milk$mass),
clade_id = milk$clade_id
)
flist <- alist(
K ~ dnorm( mu , sigma ) ,
mu <- a[clade_id] +b[clade_id]*(M-Mbar),
a[clade_id] ~ dnorm( 0 , 0.5 ) ,
b[clade_id] ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
)
m4 <- quap(
flist, data = dat
) mean sd 5.5% 94.5%
a[1] -0.434267103 0.261119127 -0.85158590 -0.016948306
a[2] -0.282796279 0.478554305 -1.04761849 0.482025928
a[3] 0.368829602 0.418885891 -0.30063096 1.038290160
a[4] -0.005063270 0.498109189 -0.80113796 0.791011419
b[1] -0.002670452 0.007183324 -0.01415079 0.008809887
b[2] -0.061304007 0.040137293 -0.12545115 0.002843139
b[3] -0.047925963 0.050277759 -0.12827953 0.032427607
b[4] 0.064915884 0.046232350 -0.00897234 0.138804109
sigma 0.692512121 0.092025313 0.54543790 0.839586344
xseq <- seq( min(milk$mass), max(milk$mass), len=100)
Mbar = mean(milk$mass)
custom_colors = c("#1c5253", "#e07a5f", "#f2cc8f", "#81b29a")
colors = custom_colors[milk$clade_id]
plot(milk$K ~ milk$mass, col = colors,
pch = 16,
xlim = range(milk$mass), ylim = range(milk$K),
xlab = "height", ylab = "weight")
mu1 <-
link(m4, data=list(clade_id=rep(1,100), M=xseq, Mbar = Mbar))
lines(xseq, apply(mu1, 2, mean), lwd = 2, col = "#1c5253" )
mu2 <-
link(m4, data=list(clade_id=rep(2,100), M=xseq, Mbar = Mbar))
lines(xseq, apply(mu2, 2, mean), lwd = 2, col = "#e07a5f" )
mu3 <-
link(m4, data=list(clade_id=rep(3,100), M=xseq, Mbar = Mbar))
lines(xseq, apply(mu3, 2, mean), lwd = 2, col = "#f2cc8f" )
mu4 <-
link(m4, data=list(clade_id=rep(4,100), M=xseq, Mbar = Mbar))
lines(xseq, apply(mu4, 2, mean), lwd = 2, col = "#81b29a" )
legend("topright", legend = levels(milk$clade),
col = custom_colors, pch = 16)